-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add/asyncio aiohttp acceleration #52
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #52 +/- ##
===========================================
+ Coverage 76.76% 90.75% +13.98%
===========================================
Files 12 12
Lines 383 400 +17
===========================================
+ Hits 294 363 +69
+ Misses 89 37 -52
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good start! I had some questions, and then when there is a relatively ready version I'd like to test it locally and compare the times.
@@ -12,6 +12,7 @@ and **Merged pull requests**. Critical items to know are: | |||
Referenced versions in headers are tagged on Github, in parentheses are for pypi. | |||
|
|||
## [vxx](https://github.com/urlstechie/urlschecker-python/tree/master) (master) | |||
- accelerate code using asyncio and aiohttp (0.0.23) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a version bump,up to 0.1.0 since we would be fundamentally changing the core library and making it un-usable for older python versions.
@@ -10,6 +10,11 @@ and then test for and report broken links. If you are interesting in using | |||
this as a GitHub action, see [urlchecker-action](https://github.com/urlstechie/urlchecker-action). There are also container | |||
bases available on [quay.io/urlstechie/urlchecker](https://quay.io/repository/urlstechie/urlchecker?tab=tags). | |||
|
|||
## Module Dependencies | |||
**Versions <= 0.0.22** are built around the [Requests](https://requests.readthedocs.io/en/master/) library whereas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Versions <= 0.0.22** are built around the [Requests](https://requests.readthedocs.io/en/master/) library whereas | |
**versions <= 0.0.22** are built around the [Requests](https://requests.readthedocs.io/en/master/) library whereas |
@@ -88,7 +93,7 @@ optional arguments: | |||
--save SAVE Path to a csv file to save results to. | |||
--retry-count RETRY_COUNT | |||
retry count upon failure (defaults to 2, one retry). | |||
--timeout TIMEOUT timeout (seconds) to provide to the requests library | |||
--timeout TIMEOUT timeout (minutes) to provide to the aiohttp library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not the change the API here - if the user provides seconds, we should just convert to minutes for the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I tried to keep the numbers the same, that's why I only changed the unit, but I will fix this.
@@ -53,18 +53,18 @@ def test_check_file_type(file_path, file_types): | |||
["tests/test_files/sample_test_file.md", "tests/test_files/sample_test_file.py"], | |||
) | |||
@pytest.mark.parametrize( | |||
"white_list_patterns", [["[.py]"], ["[.md]"], ["tests/test_file"]] | |||
"exclude_patterns", [["[.py]"], ["[.md]"], ["tests/test_file"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"exclude_patterns", [["[.py]"], ["[.md]"], ["tests/test_file"]] | |
"exclude_patterns", [["[.]py$"], ["[.]md$"], ["tests/test_file"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain this change to me please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! So your current regular expressions are matching exactly .py. We usually only need brackets when we want an exact match of a character that could be a regular expression (in the example above, the period). So I moved the brackets around just the periods, left the letters as is, and added a $ to indicate we only want to match the end of the line (e.g., we wouldn't want to match filename.python-bindings or .mdl
# When we break from while, we record final response | ||
self.record_response(url, response) | ||
|
||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to not have this extra exception block. We should be aware of issues going on and try to catch them explicitly or otherwise address them.
# When we break from while, we record final response | ||
self.record_response(url, response) | ||
# handle different py versions support | ||
if (3, 7) <= sys.version_info: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well I tested the code using python 3.6 and above. asyncio.run()
was introduced in Python 3.7 and using the other alternative caused either warnings or errors so this was the best alternative imo to keep support for 3.6 and still provide a working code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's step back a second - what are the advantages of making this change? So far, I'm seeing that it introduces more potential for error because of the overlapping requests (leading to 429), adds an additional dependency (requests was just one), it hard to maintain between changing versions, and makes it impossible to use for Python < 3.5. Before you do a ton of work, we should seriously consider if this change is improving the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well there are clearly many drawbacks as you mention. The clear pro for this change is the huge speed difference. I cannot think of something else for now.
Co-authored-by: Vanessasaurus <[email protected]>
Co-authored-by: Vanessasaurus <[email protected]>
I don't see where this comment is but I wanted to respond:
This is actually my original concern - most servers implement rate limiting, so if you do too many requests too soon, you'll get a 429 and the ip address is blocked. We would want to speed up many things with aiohttp potentially, but checking (and retrying) urls is very likely to lead to being blocked (and the serial approach with a timeout is better). So this is my main concern for wanting to update this to a newer technology, aside from not supporting under python 3.5, we actually add a feature that makes it more likely to have an error. |
You are right about this, that's why I asked #52 if we should add this as an argument activated feature. |
This PR :
@vsoch I tried my best to make sure that all the features are working similarly to previous versions. However, due to the libs differences from requests, the
timeout
arg is now defined in minutes.This PR will need some thorough testing due to the implemented major changes in the core of urlchecker. However, no stress so take your time 😉